Sampling Rankings
نویسنده
چکیده
Abstract In this work, I present a recursive algorithm for computing the number of rankings consistent with a set of optimal candidates in the framework of Optimality Theory. The ability to measure this quantity, which I call the r-volume, allows a simple and effective Bayesian strategy in learning: choose the candidate preferred by a plurality of the rankings consistent with previous observations. With k constraints, this strategy is guaranteed to make fewer than k log 2 (k) mistaken predictions. This improves the k bound on mistakes for Tesar and Smolensky’s Constraint Demotion algorithm, and I show that it is within a logarithmic factor of the best possible mistake bound for learning rankings. Though, the recursive algorithm is vastly better than brute-force enumeration in vastly many cases, the counting problem is inherently hard (#P-complete), so the worst cases will be intractable for large k. This complexity can, however, be avoided if r-volumes are estimated via sampling. In this case—though it is never computed—the r-volume of a candidate is proportional to its likelihood of being selected by a given sampled ranking. In addition to polling rankings to find candidates with maximal r-volume, sampling can be used to make predictions whose probability matches r-volume. This latter mechanism has been independently used to model linguistic variation, so the use of sampling in learning offers a formal connection between tendencies in variation and asymmetries in typological distributions. The ability to compute r-volumes makes it possible to assess this connection and to provide a precise quantitative evaluation of the sampling model of variation. The second half of the paper reviews a range of cases in which r-volume is correlated with frequency of typological attestation and frequency of use in variation.
منابع مشابه
Multiple Comparisons with the Best: Bayesian Precision Measures of Efficiency Rankings*
A large literature measures the allocative and technical efficiency of a set of firms using econometric techniques to estimate stochastic production frontiers or distance functions. Typically, researchers compute only the precision of individual efficiency rankings. Recently, Horrace and Schmidt (2000) have applied sampling theoretic statistical techniques known as multiple comparisons with a c...
متن کاملTesting University Rankings Statistically: Why this Perhaps is not such a Good Idea after All. Some Reflections on Statistical Power, Effect Size, Random Sampling and Imaginary Populations
In this paper we discuss and question the use of statistical significance tests in relation to university rankings as recently suggested. We outline the assumptions behind and interpretations of statistical significance tests and relate this to examples from the recent SCImago Institutions Ranking. By use of statistical power analyses and demonstration of effect sizes, we emphasize that importa...
متن کاملA framework for list representation, enabling list stabilization through incorporation of gene exchangeabilities.
Analysis of multivariate data sets from, for example, microarray studies frequently results in lists of genes which are associated with some response of interest. The biological interpretation is often complicated by the statistical instability of the obtained gene lists, which may partly be due to the functional redundancy among genes, implying that multiple genes can play exchangeable roles i...
متن کاملDifferentially Private Rank Aggregation
Given a collection of rankings of a set of items, rank aggregation seeks to compute a ranking that can serve as a single best representative of the collection. Rank aggregation is a well-studied problem and a number of effective algorithmic solutions have been proposed in the literature. However, when individuals are asked to contribute a ranking, they may be concerned that their personal prefe...
متن کاملDirichlet Process Mixtures of Generalized Mallows Models
We present a Dirichlet process mixture model over discrete incomplete rankings and study two Gibbs sampling inference techniques for estimating posterior clusterings. The first approach uses a slice sampling subcomponent for estimating cluster parameters. The second approach marginalizes out several cluster parameters by taking advantage of approximations to the conditional posteriors. We empir...
متن کامل